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ABSTRACT 

One of the most promising speech coder at 
the bit rate of 9.6 - 4.8 kbits/s is CELP [l]. 
CELP has been dominating 9.6 to 4.8 kbits/s 
region during the past 3 to 4 years, its set back 
however, is its expensive implementation. As an 
alternative to CELP. we have developed the 
Base-Band CELP (CELP-BB) [2] which produced 
good quality speech comparable to CELP and a 
single chip implementable complexity as reported 
in [3]. We have also improved its robustness to 
tolerate errors up to 1.0% and maintain intelligi- 
bility up to 5.0% and more [4], Although, 
CELP-BB produces good quality speech at around 

4.8 kbits/s, it has a fundamental problem when 
updating the pitch filter memory. We proposed a 
sub-optimal solution to this problem in this 
paper. Below 4.8 kbits/s, however, CELP-BB 
suffers from noticeable quantization noise as 
result of the large vector dimensions used. In 
this paper, therefore, we also report on efficient 
representation of speech below 4.8 kbits/s by 
introducing Sinusoidal Transform Coding (STC) 
[5] to represent the LPC excitation which is 
called Sine Wave Excited LPC (SWELP). In this 
case natural sounding good quality synthetic 
speech is obtained at around 2.4 kbits/s. 

1. INTRODUCTION 

The type of speech compression technique is 
very important for maritime and land mobile 
satellite communication systems. For these sys- 
tems, the resources are very limited interms of 
the very small transceiver terminals requiring 
larger satellite power, and the very restricted 
bandwidth currently available. This especially 
applies to the land mobile satellite service which 
currently has only 4 MHz allocated on primary 


service transmission. For such services to be 
economic, they must employ very narrow 
bandwidth per channel. The competition is with 
analogue systems that employ Amplitude Com- 
panded Single Side Band (ACSSB) and achieve 
reasonable performance at C / N 0 s of around 50 
dB-Hz in 5 kHz transmitted bandwidth. Now in 
order to be competitive and to use modulation 
schemes that will not cause excessive distortion 
over the difficult land mobile propagation chan- 
nel, digital speech coding of around 4.8 kbits/s 
or less is required. The performance must be 
better than the analogue contender in worst case 
degraded channels and the speech quality must 
be acceptable enough to be connected on to PSTN 
transmission. 

With Land mobile satellite systems 
(INMARSAT, AVSAT, MSAT, etc.) proposing to 
operate voice services soon, the race is on to pro- 
duce digital speech coders that can meet all the 
requirements. Until now Analysis By Synthesis 
(ABS) schemes such as CELP have only achieved 
these qualities down to 6 kbits/s. Its quality at 

4.8 kbits/s can also be made adequate by post 
filtering the recovered speech. However, below 

4.8 kbits/s, most of these schemes suffer from 
noticeable quantization noise as result of the 
large vector dimensions used. Another major 
disadvantage with these schemes is that they are 
not really practical for real time implementation 
owning to enormous computational demand. For 
the land mobile service we are looking for a 
speech coder whose cost is a fraction of the 
mobile terminal, thus we need a scheme which is 
simple to implement in a single DSP chip. The 
CELP-BB [2] satisfies these requirements, how- 
ever below 4.8 kbits/s it suffers from noticeable 
quantization noise as mentioned earlier. In this 
paper we present results of a coder called Sine 
Wave Excited LPC (SWELP) which we believe is 
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capable of meeting all the requirements and thus 
is a serious contender for future land mobile 
satellite systems. 

2. OVERVIEW OF CELP-HB SYSTEM 

In CELP-BB system, a block of speech, 
S(n ) is first analized and 10 LPC coefficients are 
computed. These are transformed into Line Spec- 
tral Frequencies (LSF) and then quantized [4]. 
The quantized LSF are inverse transformed into 
LPC parameters which are then used to form an 
inverse filter to derive the LPC residual signal. 
The LPC residual signal is then divided into 
sub-blocks, each of which is filtered by the 
weighting filter (smoothing filter) separately. 
Filtered sub-blocks are split into a number of 
sequences equal to the decimation factor. These 
sequences are compared in terms of their energies 
and the one with the highest energy is selected 
for transmission. The position of the selected 
sequence in each sub-block is also transmitted to 
the decoder in order to place the sequence in the 
correct location in High Frequency Regeneration 
(HFR). The selected decimated signal is then 
treated as a continuous signal which now con- 
tains decimation factor times less samples than 
the original. This continuous signal is then quan- 
tized via a restricted ABS procedure operating 
around merely a pitch synthesis filter. First the 
pitch filter delay and gain of the continuous sig- 
nal are computed by cross correlation with the 
past decoded samples in the pitch synthesis 
filter. Using these parameters in a pitch synthesis 
filter the memory response of the filter is com- 
puted and subtracted from the decimated signal 
to form the reference signal. Gaussian code- book 
sequences are then searched one by one to match 


index of the optimum sequence, together with its 
scale value are transmitted to the decoder. At the 
decoder, selected code-book sequences are scaled 
up by their scale factors and passed through the 
pitch synthesis filter in order to recover the con- 
tinuous decimated base-band signal. The 
recovered signal is then sub segmented and 
shifted to the correct positions with zeros 
inserted in between the samples, to form the 
LPC filter excitation sequence. Using this 
sequence the LPC synthesis filter is excited to 
recover the output speech. 

2.1 The Performance Of CELP-BB 

Although the encoder of CELP-BB appears 
very similar to CELP [l], it in fact offers one 
very significant advantage, namely simplicity. As 
the ABS procedure operates on the base-band, the 
vector dimensions are reduced by a factor of 
decimation. Also the ABS procedure is restricted 
around the pitch synthesis filter and noise shap- 
ing filter. This eliminates considerable amount of 
convolution computations required by the LPC 
filter as in standard CELP. These two differences 
contribute to the enormous savings in computa- 
tions. Although, CELP-BB is much simpler cod- 
ing scheme, its speech quality remains to be com- 
parable to CELP [1]. 

Although, CELP-BB produces good quality 
speech at around 4.8 kbits/s with a single chip 
implementable complexity [3], it has a funda- 
mental problem updating the pitch filter 
memory. The worst case situation arises, when 
the first and the last (third) decimated sequences 
are selected in two consequative sub-blocks as 
shown in figure 2.1. 


1. Dec. Sequence 3. Dec. Sequence 








• . , 





, „ . 

- - 1 

o 

o 

o 

o 

o 

o 

o 

o 

L - - • 


1. Sub-Block ! 2. Sub-Block 


Figure 2.1 The representation of selected sequences in worst case. 


the output response of the pitch synthesis filter In this figure, l’s represent the selected samples 

with no memory, to the reference signal. The of each sub-block having the highest energy 
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comparing with the other sequences which are 
represented by 0's. In this case. The first and the 
third decimated sequences are selected for 
transmission for the first sub-block and second 
sub-block respectively. When the selected 
sequences are treated as a continuous signal 
which each consists of decimation factor times 
less samples than the original, in the worst case 
as shown in figure 2.1, an offset time occurs in 
the third sequence of the second sub-block. Since 
the pitch filter operates on the decimated 
sequence, this situation disturbes the pitch struc- 
ture. Therefore, a method has to be found to 
update the pitch filter memory efficiently and 
accurately. One solution for this problem is to 
interpolate the decimated sequences in the pitch 
filter memory. With this strategy, pitch delay 
can be calculated by maximizing the equation 2.1 
as, 

N ID 

E(p )= £ /i(iD+/>)r(i) (2.1) 

» = i 

where N is the sub-block size, D is the decima- 
tion factor, p represents the pitch delay which 
ranges from 20 to 147 samples, h (O is the inter- 
polated samples of pitch filter memory and r (i ) 
is the selected decimated sequence for the 
updated sub-block. In this case, although, the 
performance improvement is dependent on the 
accuracy of interpolation, we have noticed par- 
tial improvement with a simple interpolation 
schemes. 

Below 4.8 kbits/s, however, CELP-BB 
suffers from noticeable quantization noise, back 
ground noise and hence roughness as result of 
the large vector dimensions used. The reason for 
this is that during vector quantization of excita- 
tion, all of the components are matched as a sin- 
gle vector. This results in poor matching perfor- 
mance for larger vector dimensions. Therefore, 
we have introduced the Sinusoidal Transform 
Coding (STC) [5] to represent the LPC excitation. 
Whit this it is possible to code speech below 4.8 
kbits/s with a minimum quality loss. This sys- 
tem is called Sine Wave Excited LPC (SWELP) 
The results of this system is addressed in next 
sections. 

3, SWELP CODING SYSTEM 

In the speech production model, the speech 
waveform, S (n. ) is assumed to be the output of 
passing a glottal excitation waveform, r (n ) 


through a linear time-varying system with 
impulse response h (n ) representing the charac- 
teristics of the vocal tract. Mathematically, this 
can be written as, 

S(n )= £r(m)/i(n— m) (3.1) 

m = 0 

The excitation will be represented by a sum of 
sine waves of arbitrary amplitudes, frequencies 
and phases, 

r m (n )= £A?Cos(na>F+Qj?) (3.2) 

i = 1 

where L m is the number of sine-waves and 
A j?,(t) k and are the amplitude, frequency and 
phase respectively for the k th sine- wave com- 
ponents at the frame. 

A block diagram of the SWELP 
Analysis/Synthesis system is given in figure 3.1 
which operates as follows: 

3.1 Analysis 

A block of speech, S (n ) is first LPC anal- 
ized to obtain the LPC coefficients. These 
coefficients are then quantized. The quantized 
coefficients are used to form an inverse filter to 
derive the LPC excitation in sub-blocks, r (n ) 

r in )= S(n ) — ^ a k S(n— k ) (3.3) 

* = i 

where a k 's are the LPC filter coefficients and p is 
the order of the filter. The spectrum of LPC exci- 
tation is then analized using a 512 point FFT and 
a hamming window having a minimum width of 
2.5 times the average pitch for accurate peak 
estimation. The spectrum. ^ (w t ) can be com- 
puted as, 

R(<o k > I 1 r(n)W(/i> _Jn “* (3.4) 

71 = 0 

where W (n ) is a hamming window and (o k \r are 
the discrete frequencies (o>* = - ■)■ 

The number of peaks L m is typically about 
40 - 50 over a 4 kHz range. The maximum 
number of peaks that can be specified is limited 
by a threshold that is also function of the meas- 
ured average fundamental. In general, the perfor- 
mance can be affected by the choice of this 
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threshold only when too few peaks were 
allowed. The locations of the largest peaks are 
estimated by simply searching for a change of 
slope from positive to negative in the uniformly 
spaced samples of the short-time Fourier 
transform magnitude. C I R ((o * ) I ). The ampli- 
tude and phase components (modula 2ir ) of the 
sine waves are given by the appropriate samples 
of the high resolution FFT corresponding to 
R (w* ) at the chosen frequency locations. 

3.2 Synthesis 

The LPC coefficients and the set of ampli- 
tudes. frequencies and phases which are 
estimated in the encoder, are transmitted to the 
decoder. Received set of amplitudes, frequencies 
and phases are used to generate the sine waves 
for each tone. The generated sine waves are then 
added together to form the LPC excitation. r (n ) 
using equation 3.2. The final quantized speech. 
S (n ) is then obtained by passing the recovered 
LPC excitation through the LPC filter, 

S (n )= r (n )+ j^a t S(n—k) (3.5) 

k = 1 

Due to the time-varying nature of the 
parameters, however, this straightforward 
approach leads to discontinuities at the frame 
boundaries, if this approach is directly applied to 
speech as in [5]. Therefore, a method was found 
which smoothly interpolates the parameters 
measured from one block to those that are 
obtained in the next. Although, recovered speech 
in [5] is interpolated, this interpolation procedure 
introduces some back ground noise and block 
edge effects in the recovered speech. In SWELP 
system, we used the LPC filter to interpolate the 
sine wave components. This way all the discon- 
tinuities were eliminated from the recovered 
speech with the cost of coding the LPC 
coefficients. 

4. BIT RATE REDUCTION STRATEGIES 

Since the parameters of the SWELP system 
are the LPC coefficients, amplitudes, frequencies 
and phases of the underlying sine waves, and 
since for a typical low-pitched speaker there can 
be as many as 80 sine waves in a 4 kHz speech 
bandwidth, it is not possible to code all of the 
parameters directly. Therefore, an important 
focus of this work has been on techniques for 


efficient coding of the model parameters. The first 
step in reducing the size of the parameter set to 
be coded, was to develop a pitch extraction algo- 
rithm, which leads to a harmonic set of sine 
waves. The computed harmonic set is perceptual 
best fit to the measured sine waves. With this 
strategy, coding of individual sine wave frequen- 
cies is avoided. A new set of sine wave ampli- 
tudes and phases is then obtained by sampling 
an amplitude and phase envelop at the pitch har- 
monics. 

In addition to the development of pitch 
extraction algorithm which led to a harmonic set 
of sine waves, a predictive model for the phases 
of sine waves was developed. The model given in 
[6] is quite accurate during steady voiced speech, 
while during unvoiced speech, it is poor, result- 
ing in phase excitations that appeared to be ran- 
dom values within [— 7r,7r]. For this purpose, we 
have developed another phase prediction model 
which works in both voiced and unvoiced speech 
regions. Since the speech coder is independent of 
v/uv decision, another parameter which is called 
error compensation component, has to be coded 
and transmitted to the receiver. The recovered 
LPC excitation was then presented as, 

r(n)= 'EA{ n cos((n-n 0 )k<o 0 +<J> k ) (4.1) 
* = o 

where <!>* = (“4> 0 )* +1 is the phase and frequency 
error compensation component, n Q is the onset 
time of the pitch pulse [6] and a> 0 is the funda- 
mental frequency. Experiments showed that dur- 
ing steady voiced speech, <I> 0 was automatically 
selected as zero, on the other hand, during 
unvoiced speech < X > 0 * +1 took a value between 

[— 7T,7r]. Since the amplitudes of the LPC excita- 
tion sine waves are more or less flat, a good cri- 
terion to use is the minimum mean-squared error 
for seeking the optimum values of n 0 and 4> 0 . 

A block diagram of the complete 
analysis/synthesis system is given in figure 3.1. 
A non-real time floating point simulation was 
developed in order to determine the effectiveness 
of the proposed approach in modeling real 
speech. In SWELP system, the LPC and Spec- 
trum analysis took place on block by block and 
sub-block by sub-block basis respectively. In 
LPC analysis using 30 ms (240 sample) block 
intervals (each consists of 2 sub-blocks), 7 or 8 
LPC coefficients was found to be sufficient for 
smoothly interpolating the sine wave 
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components. A 512 point DFT using a 20-22 ms 
Hamming window was found to be sufficient for 
accurate peak estimation for each sub-block of 
LPC excitation. The overall bit rate of SWELP 
system is chiefly determined by allocating a cer- 
tain number of bits for the LPC coefficients and 
sine wave components for each block of speech. 
We therefore, feel that by allocating more sine 
wave components to the excitation representa- 
tion, the overall bit rate can be varied from 2.4 
to 4.8 kbits/s or higher. This also varies the 
quality of speech. Thus, this scheme can easily 
be operated in a variable rate environment if 
required. A 2.4 kbits/s SWELP was simulated 
using vector quantization for both LPC 
coefficients and amplitude components of sine 
waves. The bit allocation for the coder imple- 
mentation is shown in table 4.1, and the 
waveforms of the original and decoded speech 
are shown in figure 4.1. In this case, 2.4 kbits/s 
SWELP system produces natural sounding good 
quality synthetic speech. 


Parameter 

Bits Per Frame 

Bit Rate 

LPC Coeff. 

13 

500.0 

Fund. Freq. 

14 

400.0 

Phases (n 0 ,<I> 0 ) 

18 

600.0 

Amplitudes 

27 

900.0 

Overall 

72 

2400.0 


Table 4.1 Bit allocations for 2.4 kbits/s SWELP 
coder. 


5. CONCLUSION 

In this paper, the advantages of CELP-BB, 
its problems and solutions were examined. We 
saw that the interpolation of the decimated 
sequences in the pitch filter memory improved 
the subjective performance of CELP-BB system. 
Below 4.8 kbits/s, however, it seemed that 
efficient representation of the model parameters 
for good quality speech is very difficult. There- 
fore, we presented the results of SWELP system 
which is used to represent the LPC excitation at 
very low bit rates (around 2.4 kbits/s) and pro- 
duced good quality speech. The strategies for bit 
rate reduction in transmission parameters were 
described. Depending on the detailed bit alloca- 
tion rules, operation at rates from 2.4 to 9.6 
kbits/s can be obtained with the variation of 
speech quality. Thus, this scheme can easily be 


operated in a variable rate environment if 
required. Finally, a 2.4 kbits/s SWELP system 
was simulated and bit allocation of its parame- 
ters has been given as an example. The speech 
produced by this system was found very intelli- 
gible and natural sounding. Currently, the com- 
plete analysis/synthesis blocks of the SWELP 
scheme is being modified to form an Analysis By 
Synthesis (ABS) select procedure for reduced set 
of amplitudes, frequencies and phases. Results of 
this modification will be published later. 
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